San Diego Burrito Analytics: Linear models

Scott Cole

21 May 2016

This notebook attempts to predict the overall rating of a burrito as a linear combination of its dimensions. Interpretation of these models is complicated by the significant correlations between dimensions (such as meat quality and non-meat filling quality).

Imports


In [1]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

import seaborn as sns
sns.set_style("white")

Load data


In [2]:
import util
df = util.load_burritos()
N = df.shape[0]

Linear model 1: Predict overall rating from the individual dimensions


In [3]:
# Define predictors of the model
m_lm = ['Hunger','Cost','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Wrap']

# Remove incomplete data 
dffull = df[np.hstack((m_lm,'overall'))].dropna()
X = sm.add_constant(dffull[m_lm])
y = dffull['overall']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())
print 1 - np.var(res.resid_pearson) / np.var(y)


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                overall   No. Observations:                  127
Model:                            GLM   Df Residuals:                      116
Model Family:                Gaussian   Df Model:                           10
Link Function:               identity   Scale:                   0.13772580869
Method:                          IRLS   Log-Likelihood:                -48.564
Date:                Fri, 01 Jul 2016   Deviance:                       15.976
Time:                        12:59:51   Pearson chi2:                     16.0
No. Iterations:                     4                                         
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const           -0.6625      0.344     -1.926      0.054        -1.337     0.012
Hunger           0.0183      0.043      0.420      0.674        -0.067     0.103
Cost             0.0233      0.034      0.693      0.488        -0.043     0.089
Tortilla         0.0916      0.053      1.736      0.083        -0.012     0.195
Temp             0.0209      0.033      0.630      0.529        -0.044     0.086
Meat             0.3042      0.050      6.039      0.000         0.206     0.403
Fillings         0.3239      0.055      5.899      0.000         0.216     0.432
Meat:filling     0.1317      0.038      3.510      0.000         0.058     0.205
Uniformity       0.0723      0.033      2.218      0.027         0.008     0.136
Salsa            0.1585      0.041      3.860      0.000         0.078     0.239
Wrap             0.0419      0.029      1.448      0.148        -0.015     0.099
================================================================================
0.759826114553

In [4]:
# Visualize coefficients
from tools.plt import bar
newidx = np.argsort(-res.params.values)
temp = np.arange(len(newidx))
newidx = np.delete(newidx,temp[newidx==0])
bar(res.params[newidx],res.bse[newidx],X.keys()[newidx],'Overall rating\nLinear model\ncoefficient',
    ylim =(0,.5),figsize=(11,3))
plt.plot()

figname = 'overall_metric_linearmodelcoef'
plt.savefig('C:/gh/fig/burrito/'+figname + '.png')


Linear model 2: predict overall rating from ingredients

This linear model is no better than generating random features, showing that simply a good choice of ingredients is not sufficient to making a high quality burrito.


In [5]:
# Get all ingredient keys
startingredients = 29
ingredientkeys = df.keys()[startingredients:]

# Get all ingredient keys with at least 10 burritos
Nlim = 10
ingredientkeys = ingredientkeys[df.count()[startingredients:].values>=Nlim]

# Make a dataframe for all ingredients
dfing = df[ingredientkeys]

# Convert data to binary
for k in dfing.keys():
    dfing[k] = dfing[k].map({'x':1,'X':1,1:1})
    dfing[k] = dfing[k].fillna(0)


C:\Users\Scott\Anaconda2\lib\site-packages\ipykernel\__main__.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Scott\Anaconda2\lib\site-packages\ipykernel\__main__.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [6]:
# Run a general linear model to predict overall burrito rating from ingredients
X = sm.add_constant(dfing)
y = df.overall
lm = sm.GLM(y,X)
res = lm.fit()
print(res.summary())
origR2 = 1 - np.var(res.resid_pearson) / np.var(y)


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                overall   No. Observations:                  142
Model:                            GLM   Df Residuals:                      131
Model Family:                Gaussian   Df Model:                           10
Link Function:               identity   Scale:                  0.539653838577
Method:                          IRLS   Log-Likelihood:                -151.97
Date:                Fri, 01 Jul 2016   Deviance:                       70.695
Time:                        12:59:53   Pearson chi2:                     70.7
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          2.9062      0.210     13.866      0.000         2.495     3.317
Beef           0.3759      0.213      1.768      0.077        -0.041     0.792
Pico          -0.0310      0.150     -0.207      0.836        -0.325     0.263
Guac           0.1412      0.156      0.904      0.366        -0.165     0.447
Cheese         0.1926      0.188      1.022      0.307        -0.177     0.562
Fries          0.0844      0.191      0.441      0.659        -0.290     0.459
Sour cream     0.1400      0.174      0.806      0.420        -0.200     0.480
Pork           0.3650      0.235      1.554      0.120        -0.095     0.825
Rice          -0.0333      0.255     -0.131      0.896        -0.533     0.466
Beans         -0.3139      0.233     -1.345      0.179        -0.771     0.144
Sauce          0.2777      0.234      1.187      0.235        -0.181     0.736
==============================================================================

In [7]:
# Test if the variance explained in this linear model is significantly better than chance
np.random.seed(0)
Nsurr = 1000
randr2 = np.zeros(Nsurr)
for n in range(Nsurr):
    Xrand = np.random.rand(X.shape[0],X.shape[1])
    Xrand[:,0] = np.ones(X.shape[0])
    lm = sm.GLM(y,Xrand)
    res = lm.fit()
    randr2[n] = 1 - np.var(res.resid_pearson) / np.var(y)
print 'p = ' , np.mean(randr2>origR2)


p =  0.028

Linear model 3. Predicting Yelp ratings

Can also do this for Google ratings Note, interestingly, that the Tortilla rating is most positively correlated with Yelp and Google ratings. This is significant in a linear model when accounting for the overall rating.


In [8]:
# Average each metric over each Location
# Avoid case issues; in the future should avoid article issues
df.Location = df.Location.str.lower()
m_Location = ['Location','N','Yelp','Google','Hunger','Cost','Volume','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']

tacoshops = df.Location.unique()
TS = len(tacoshops)
dfmean = pd.DataFrame(np.nan, index=range(TS), columns=m_Location)
for ts in range(TS):
    dfmean.loc[ts] = df.loc[df.Location==tacoshops[ts]].mean()
    dfmean['N'][ts] = sum(df.Location == tacoshops[ts])
dfmean.Location = tacoshops

In [16]:
# Note high correlations between features
m_Yelp = ['Google','Yelp','Hunger','Cost','Volume','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']
M = len(m_Yelp)
dfmeancorr = dfmean[m_Yelp].corr()

from matplotlib import cm
clim1 = (-1,1)
plt.figure(figsize=(10,10))
cax = plt.pcolor(range(M+1), range(M+1), dfmeancorr, cmap=cm.bwr)
cbar = plt.colorbar(cax, ticks=(-1,-.5,0,.5,1))
cbar.ax.set_ylabel('Pearson correlation (r)', size=30)
plt.clim(clim1)
cbar.ax.set_yticklabels((-1,-.5,0,.5,1),size=20)
ax = plt.gca()
ax.set_yticks(np.arange(M)+.5)
ax.set_yticklabels(m_Yelp,size=25)
ax.set_xticks(np.arange(M)+.5)
ax.set_xticklabels(m_Yelp,size=25)
plt.xticks(rotation='vertical')
plt.xlim((0,M))
plt.ylim((0,M))
plt.tight_layout()



In [10]:
# GLM for Yelp: all dimensions
m_Yelp = ['Hunger','Cost','Tortilla','Temp','Meat','Fillings','Meat:filling',
               'Uniformity','Salsa','Synergy','Wrap','overall']
dffull = dfmean[np.hstack((m_Yelp,'Yelp'))].dropna()
X = sm.add_constant(dffull[m_Yelp])
y = dffull['Yelp']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())
print(res.pvalues)
print 1 - np.var(res.resid_pearson) / np.var(y)


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                   Yelp   No. Observations:                   38
Model:                            GLM   Df Residuals:                       25
Model Family:                Gaussian   Df Model:                           12
Link Function:               identity   Scale:                  0.168811206867
Method:                          IRLS   Log-Likelihood:                -12.164
Date:                Fri, 01 Jul 2016   Deviance:                       4.2203
Time:                        12:59:56   Pearson chi2:                     4.22
No. Iterations:                     4                                         
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const            3.7948      0.956      3.968      0.000         1.921     5.669
Hunger          -0.2101      0.171     -1.232      0.218        -0.544     0.124
Cost             0.0053      0.088      0.060      0.952        -0.167     0.177
Tortilla         0.1341      0.168      0.797      0.426        -0.196     0.464
Temp            -0.0336      0.100     -0.337      0.736        -0.229     0.162
Meat            -0.0106      0.227     -0.047      0.963        -0.455     0.434
Fillings         0.0650      0.244      0.266      0.790        -0.413     0.543
Meat:filling    -0.1166      0.160     -0.728      0.466        -0.431     0.197
Uniformity      -0.0871      0.120     -0.728      0.466        -0.321     0.147
Salsa           -0.1132      0.148     -0.767      0.443        -0.403     0.176
Synergy          0.0962      0.251      0.383      0.702        -0.396     0.589
Wrap            -0.0622      0.094     -0.664      0.507        -0.246     0.121
overall          0.3626      0.458      0.792      0.428        -0.534     1.260
================================================================================
const           0.000072
Hunger          0.217839
Cost            0.952162
Tortilla        0.425734
Temp            0.736197
Meat            0.962724
Fillings        0.790052
Meat:filling    0.466482
Uniformity      0.466461
Salsa           0.443130
Synergy         0.702063
Wrap            0.506969
overall         0.428189
dtype: float64
0.355864643444

In [11]:
# GLM for Yelp: some dimensions
m_Yelp = ['Tortilla','overall']
dffull = dfmean[np.hstack((m_Yelp,'Yelp'))].dropna()
X = sm.add_constant(dffull[m_Yelp])
y = dffull['Yelp']
my_glm = sm.GLM(y,X)
res = my_glm.fit()
print(res.summary())


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                   Yelp   No. Observations:                   40
Model:                            GLM   Df Residuals:                       37
Model Family:                Gaussian   Df Model:                            2
Link Function:               identity   Scale:                  0.155550854964
Method:                          IRLS   Log-Likelihood:                -17.983
Date:                Fri, 01 Jul 2016   Deviance:                       5.7554
Time:                        12:59:56   Pearson chi2:                     5.76
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          3.0184      0.305      9.911      0.000         2.421     3.615
Tortilla       0.2845      0.138      2.060      0.039         0.014     0.555
overall        0.0030      0.135      0.023      0.982        -0.262     0.268
==============================================================================

In [12]:
plt.figure(figsize=(4,4))
ax = plt.gca()
dfmean.plot(kind='scatter',x='Tortilla',y='Yelp',ax=ax,**{'s':40,'color':'k','alpha':.3})
plt.xlabel('Average Tortilla rating',size=20)
plt.ylabel('Yelp rating',size=20)
plt.xticks(np.arange(0,6),size=15)
plt.yticks(np.arange(0,6),size=15)
plt.ylim((2,5))
plt.tight_layout()
print sp.stats.spearmanr(dffull.Yelp,dffull.Tortilla)

figname = 'corr-Yelp-tortilla'
plt.savefig('C:/gh/fig/burrito/'+figname + '.png')


SpearmanrResult(correlation=0.54662170394611254, pvalue=0.00026297556035512708)

In [ ]: